Recently, multi-core architectures with alternative memory subsystem designs have emerged. Instead of using hardware-managed cache hierarchies, they employ software-managed embedded memory. An open question is what programming and compiling methods are effective to exploit the perfor-mance potential of this new class of architectures. Using the LU decomposition as a case study, we propose three tech-niques that combined achieve a 27 times speedup on the IBM Cyclops-64 many-core architecture, compared to the parallel LU implementation from the SPLASH-2 benchmarks suite. Our first method allows adaptive load distribution, which maximizes load-balance among cores – this is important to leverage the potential of the next two methods. Secondly, ...
Low-power embedded processors utilize compact instruction encodings to achieve small code size. Inst...
The power, frequency, and memory wall problems have caused a major shift in mainstream computing by ...
Modern superscalar processors use wide instruction issue widths and out-of-order execution in order ...
Abstract. Traditional parallel programming methodologies for improv-ing performance assume cache-bas...
Abstract. Power consumption and energy efficiency have become a ma-jor bottleneck in the design of n...
Driven by the motivation to expose instruction-level parallelism (ILP), microprocessor cores have ev...
Abstract—Dense LU factorization is a prominent benchmark used to rank the performance of supercomput...
Recently, multi-cores chips have become omnipresent in computer systems ranging from high-end server...
Tiled architectures have emerged as a solution to translate an increasing number of transistors into...
textThe level of Thread-Level Parallelism (TLP), Instruction-Level Parallelism (ILP), and Memory-Lev...
Programming of commodity multicore processors is a challenging task and it becomes even harder when ...
Nowadays, we are reaching a point where further improving single thread performance can only be done...
Multicore designers often add a small local memory close to each core to speed up access and to redu...
The number of cores in multicore computers has an irreversible tendency to increase. Also, computers...
Gao, Guang R.The research proposed in this thesis will provide an analysis of these new scenarios, p...
Low-power embedded processors utilize compact instruction encodings to achieve small code size. Inst...
The power, frequency, and memory wall problems have caused a major shift in mainstream computing by ...
Modern superscalar processors use wide instruction issue widths and out-of-order execution in order ...
Abstract. Traditional parallel programming methodologies for improv-ing performance assume cache-bas...
Abstract. Power consumption and energy efficiency have become a ma-jor bottleneck in the design of n...
Driven by the motivation to expose instruction-level parallelism (ILP), microprocessor cores have ev...
Abstract—Dense LU factorization is a prominent benchmark used to rank the performance of supercomput...
Recently, multi-cores chips have become omnipresent in computer systems ranging from high-end server...
Tiled architectures have emerged as a solution to translate an increasing number of transistors into...
textThe level of Thread-Level Parallelism (TLP), Instruction-Level Parallelism (ILP), and Memory-Lev...
Programming of commodity multicore processors is a challenging task and it becomes even harder when ...
Nowadays, we are reaching a point where further improving single thread performance can only be done...
Multicore designers often add a small local memory close to each core to speed up access and to redu...
The number of cores in multicore computers has an irreversible tendency to increase. Also, computers...
Gao, Guang R.The research proposed in this thesis will provide an analysis of these new scenarios, p...
Low-power embedded processors utilize compact instruction encodings to achieve small code size. Inst...
The power, frequency, and memory wall problems have caused a major shift in mainstream computing by ...
Modern superscalar processors use wide instruction issue widths and out-of-order execution in order ...